Avoid generation of original SMILES in augmentation #136

aditya0by0 · 2025-12-06T12:21:09Z

See Wiki https://github.com/ChEB-AI/python-chebai/wiki/SMILES-Augmentation#snippet-of-augmented-smiles

sfluegel05

That makes sense. I am wondering why we need the original SMILES at all. Why don't we only do the random SMILES generation (which then might include the original SMILES, but not necessarily)?

aditya0by0 · 2025-12-08T19:31:18Z

The motivation for including the original SMILES strings in the augmented dataset comes from two considerations:

Analogy to data augmentation in Computer Vision
In the vision domain, it is standard practice to include both the original images and their augmented variants (rotations, flips, scaling, color jitter, etc.) when training CNN models. The original samples provide a stable reference distribution, while augmented samples improve robustness.
By analogy, including the original SMILES ensures that the model retains exposure to the true, data distribution of ChEBI, rather than relying solely on augmented variants.
Chemical-domain constraints on SMILES generation
A more domain-specific reason is that some SMILES strings present in ChEBI can be parsed by RDKit but cannot be regenerated by RDKit’s SMILES writing algorithms.
This happens due to differences in how RDKit handles certain structural representations.
For example, RDKit often removes implicit hydrogens or normalizes parts of the representation during canonicalization. As a result, certain SMILES in ChEBI that contain specific forms of implicit hydrogens or uncommon notations will not appear in augmented outputs generated by RDKit.
Therefore, these original SMILES must be included in the training set, because they represent valid chemical structures found in ChEBI but would otherwise be lost during augmentation.

Below is an example illustrating such a case:

ident	name	SMILES
32129	diamminesilver(1+) fluoride	`[F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]`

and check its generated SMILES here: https://github.com/ChEB-AI/python-chebai/wiki/SMILES-Augmentation#snippet-of-augmented-smiles

below program reinforces this theory

import random
from itertools import cycle, permutations, product

from rdkit import Chem

AUG_SMILES_VARIATIONS = 1000000000


def generate_augmented_smiles(smiles: str) -> list[str]:
    mol: Chem.Mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return [smiles]  # if mol is None, return original SMILES

    # sanitization set to False, as it can alter the fragment representation in ways you might not want.
    # As we don’t want RDKit to "fix" fragments, only need the fragments as-is, to generate SMILES strings.
    frags = Chem.GetMolFrags(mol, asMols=True, sanitizeFrags=False)
    augmented = set()

    frag_smiles: list[set] = []
    for frag in frags:
        atom_ids = [atom.GetIdx() for atom in frag.GetAtoms()]
        random.shuffle(atom_ids)  # seed set by lightning
        atom_id_iter = cycle(atom_ids)
        frag_smiles.append(
            {
                Chem.MolToSmiles(frag, rootedAtAtom=next(atom_id_iter), doRandom=True)
                for _ in range(AUG_SMILES_VARIATIONS)
            }
        )
    if len(frags) > 1:
        for perm in permutations(frag_smiles):
            for combo in product(*perm):
                augmented.add(".".join(combo))
                if smiles in augmented:
                    print("Found original SMILES in augmented set.")
                    break
    else:
        augmented = frag_smiles[0]

    if smiles in augmented:
        print("Found original SMILES in augmented set.")
    else:
        print("Original SMILES NOT found in augmented set.")


if __name__ == "__main__":
    test_smiles = "[F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]"
    generate_augmented_smiles(test_smiles)

sfluegel05 · 2025-12-10T12:50:33Z

I would argue that the computer vision approach cannot be directly applied to SMILES. In computer vision, there is usually a "real" image and "distorted / modified" versions of the image. In SMILES, I don't see a clear distinction why the ChEBI-provided SMILES should be the "real" one that needs to be kept - all other SMILES strings are just as valid. The question then becomes: Which kind of SMILES do our users use? Is it the ChEBI-style SMILES? Is it the RDKit-canonical SMILES? I don't know.

Fun fact: The reason why we can't reproduce the ChEBI-style SMILES is that they have their own library for that: https://github.com/chembl/libRDChEBI (currently, I don't see a use case where we would need to reproduce ChEBI's exact SMILES, but if it arises, we should use this library).

Anyway, since I don't have a better suggestion, let's keep the original ChEBI-SMILES for now.

aditya0by0 added 2 commits December 6, 2025 12:54

avoid generation of original smiles in augmentation

2218fc8

pre-commit formatting

df02f55

aditya0by0 self-assigned this Dec 6, 2025

aditya0by0 added bug Something isn't working bug:fix and removed bug Something isn't working labels Dec 6, 2025

aditya0by0 linked an issue Dec 6, 2025 that may be closed by this pull request

SMILES augmentation #113

Closed

aditya0by0 requested a review from sfluegel05 December 6, 2025 12:24

sfluegel05 reviewed Dec 8, 2025

View reviewed changes

aditya0by0 requested a review from sfluegel05 December 8, 2025 19:32

sfluegel05 merged commit 8c5ebcd into dev Dec 10, 2025
5 checks passed

sfluegel05 deleted the fix/duplicate_smiles_aug branch December 10, 2025 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid generation of original SMILES in augmentation #136

Avoid generation of original SMILES in augmentation #136

Uh oh!

aditya0by0 commented Dec 6, 2025

Uh oh!

sfluegel05 left a comment

Uh oh!

aditya0by0 commented Dec 8, 2025

Uh oh!

sfluegel05 commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Avoid generation of original SMILES in augmentation #136

Avoid generation of original SMILES in augmentation #136

Uh oh!

Conversation

aditya0by0 commented Dec 6, 2025

Uh oh!

sfluegel05 left a comment

Choose a reason for hiding this comment

Uh oh!

aditya0by0 commented Dec 8, 2025

Uh oh!

sfluegel05 commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants